Very many variables and limited numbers of observations; The p>>n problem in current statistical applications
نویسنده
چکیده
New technologies have led to an “explosion” of data available to document states and processes in very many fields. Tools of data mining are being used to extract relevant information. If this information is used in decision making, analytical statistics can provide formal tests comparing the outcomes of different scenarios. Statistics has traditionally dealt with limited information, both in terms of observations and numbers of variables explaining the states of these observations. Virtually all statistical hypothesis testing was developed for such scenarios, trying to make sense from limited data, often expensive to produce. Clinical trials and the steps in development of drugs before those clinical trials are a typical examples from human medicine. Information about the genomes of individuals from DNA is becoming cheaper at an extremely fast rate. The DNA of humans and many animal species is composed of ~3.000.000.000 base pairs (nucleotides), arranged in 20-40 pairs of chromosomes. Chips extracting genetic variation from 500.000 to 2.500.000 genomic markers cost $100-300. Pharmacogenetics and –genomics refer to genetic differences in metabolic pathways which can affect individual responses to drugs, both in terms of therapeutic effect as well as adverse effects. Clearly, linking genomic information with the outcomes of a clinical trial yields a p>>n problem. We will link genomic information from human and livestock species to phenotypes in order to elucidate developments in statistics dealing with the p>>n problem. In this context, the ability to predict outcomes is often as important as the ability to understand the cause– effect relationship. Classical statistics is augmented with data mining tools. We will learn about different types of variable selection procedures trying to extract the most important explanatory variables and we will also deal with multivariate black-box approaches. From this perspective, we will look at similar scenarios in other fields of biology, ecology and economics. 13 doi:10.2498/iti.2012.0486
منابع مشابه
An Overview of the New Feature Selection Methods in Finite Mixture of Regression Models
Variable (feature) selection has attracted much attention in contemporary statistical learning and recent scientific research. This is mainly due to the rapid advancement in modern technology that allows scientists to collect data of unprecedented size and complexity. One type of statistical problem in such applications is concerned with modeling an output variable as a function of a sma...
متن کاملFeature Extraction to Identify Network Traffic with Considering Packet Loss Effects
There are huge petitions of network traffic coming from various applications on Internet. In dealing with this volume of network traffic, network management plays a crucial rule. Traffic classification is a basic technique which is used by Internet service providers (ISP) to manage network resources and to guarantee Internet security. In addition, growing bandwidth usage, at one hand, and limit...
متن کاملUse of Fuzzy Numbers for Assessing Problem Solving Skills
The importance of Problem Solving (PS) has been realized for such a long time that in a direct or indirect way affects our daily lives in many ways. Assessment cases appear frequently for PS skills which involve a degree of uncertainty and (or) ambiguity. Fuzzy logic, due to its nature of characterizing such cases with multiple values, offers rich resources for dealing with them. On ...
متن کاملNonparametric Shewhart-type Quality Control Charts in Fuzzy Environment
Nonparametric control charts are presented in order to figure out the problem of detecting changes in the process median (or mean), or changes in the variability process where there is limited knowledge regarding the underlying process. When observations are reported imprecise, then it is impossible to use classical nonparametric control charts. This paper is devoted to the problem of c...
متن کاملGeometric Programming Problem with Trapezoidal Fuzzy Variables
Nowadays Geometric Programming (GP) problem is a very popular problem in many fields. Each type of Fuzzy Geometric Programming (FGP) problem has its own solution. Sometimes we need to use the ranking function to change some part of GP to the linear one. In this paper, first, we propose a method to solve multi-objective geometric programming problem with trapezoidal fuzzy variables; then we use ...
متن کامل